Retrieval Search Engine Pattern Matcher Pattern Training Phase

نویسندگان

  • R. Chandrasekar
  • B. Srinivas
چکیده

In this paper, we describe a system called Glean, which is predicated on the idea that any coherent text contains signi cant latent information, such as syntactic structure and patterns of language use, which can be used to enhance the performance of Information Retrieval systems. We propose an approach to information retrieval that makes use of syntactic information obtained using a tool called a supertagger. A supertagger is used on a corpus of training material to semi-automatically induce patterns that we call augmented-patterns. We show how these augmented patterns may be used along with a standard Web search engine or an IR system to retrieve information, and to identify relevant information and lter out irrelevant items. We describe an experiment in the domain of o cial appointments, where such patterns are shown to reduce the number of potentially irrelevant documents by upwards of 80%. Introduction: IR and WWW Vast amounts of textual information are now available in machine-readable form, and a signi cant proportion of this is available over the World Wide Web (WWW). However, any particular user would typically be interested only in a fraction of the information available. The goal addressed by Information Retrieval (IR) systems and services in general and by search engines on the Web in particular is to retrieve all and only the information that is relevant to the query posed by a user. Early information retrieval systems treated stored text as arbitrary streams of characters. Retrieval was usually based on exact word matching, and it did not matter if the stored text was in English, Hindi, Spanish, etc. Later IR systems treated text as a collection On leave from the National Centre for Software Technology, Gulmohar Cross Road No. 9, Juhu, Bombay 400 049, India of words, and hence several new features were made possible, including the use of term expansion, morphological analysis, and phrase-indexing. However, all these methods have their limitations, and there have been several attempts to go beyond these methods. See (Salton & McGill 1983), (Frakes & Baeza-Yates 1992) for further details on work in information retrieval. With the recent growth in activity on the Web, much more information has become accessible online. Several search engines have been developed to handle this explosion of information. These search engines typically explore hyperlinks on the Web, and index information that they encounter. All the information that they index thus becomes available to users' searches. As with most IR systems, these search engines use inverted indexes to ensure speed of retrieval, and the user is thus able to get pointers to potentially relevant information very fast. However, these systems usually o er only keyword-based searches. Some offer boolean searches, and features such as proximity and adjacency operators. Since the retrieval engines are geared to maximizing recall, there is little or no attempt to intelligently lter the information spewed out at the user. The user has to scan a large number of potentially relevant items to get to the information that she is actually looking for. Thus, even among experienced users of IR systems, there is a high degree of frustration experienced in searching for information on the Web. Many of the (non-image) documents available on the Web are natural language (NL) texts. Since they are available in machine-readable form, there is a lot of scope for trying out di erent NL techniques on these texts. However, there has not been much work in applying these techniques to tasks such as information retrieval. In this paper, we describe an application which uses NL techniques to enhance retrieval. The system we describe is predicated on the fact that any coherent text contains signi cant latent information, such as syntactic structure and patterns of language AAAI Spring Symp. on NLP for the WWW, Stanford, March ’97 use, which can be used to reduce an IR or Web users' information load.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Deductive Pattern Matcher

This paper describes the design of a pattern matcher for a knowledge representation system called LOOM. The pattern matcher has a very rich pattern-forming language, and is logic-based, with a deductive mechanism which includes a truth-maintenance component as an integral part of the pattern-matching logic. The technology behind the LOOM matcher uses an inference engine called a classiier to pe...

متن کامل

A Multiple-Stage Framework for Related Entity Finding: FDWIM at TREC 2010 Entity Track

This paper describes a multiple-stage retrieval framework for the task of related entity finding on TREC 2010 Entity Track. In the document retrieval stage, search engine is used to improve the retrieval accuracy. In the entity extraction and filtering stage, we extract entity with NER tools, Wikipedia and text pattern recognition. Then stoplist and other rules are employed to filter entity. De...

متن کامل

Review of ranked-based and unranked-based metrics for determining the effectiveness of search engines

Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...

متن کامل

Learning Unsupervised SVM Classifier for Answer Selection in Web Question Answering

Previous machine learning techniques for answer selection in question answering (QA) have required question-answer training pairs. It has been too expensive and labor-intensive, however, to collect these training pairs. This paper presents a novel unsupervised support vector machine (USVM) classifier for answer selection, which is independent of language and does not require hand-tagged trainin...

متن کامل

A Complete Year of User Retrieval Sessions in a Social Sciences Academic Search Engine

In this paper, we present an open data set extracted from the transaction log of the social sciences academic search engine sowiport. The data set includes a filtered set of 484,449 retrieval sessions which have been carried out by sowiport users in the period from April 2014 to April 2015. We propose a description of the data set features that can be used as ground truth for different applicat...

متن کامل

Monolingual and Cross-language QA using a QA-oriented Passage Retrieval System

This report describes the work done by the RFIA group at the Departamento de Sistemas Informáticos y Computación of the Universidad Politécnica of Valencia for the 2005 edition of the CLEF Question Answering task. We participated in three monolingual tasks: Spanish, Italian and French, and in two cross-language tasks: spanish to english and english to spanish. Since this was our first participa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997